Part of the DAND project, the EDA project lead the student to analyse one of the following dataset: * White Wine quality dataset * Red Wine quality dataset * Financial contribution to Presidential campaing by states * Loan Data from prosper * Student dataset
The dataset chosen is the Red Wine Dataset, it encompasses 11 variables and 1 ‘output’ varible. The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine.
The dataset is composed of 1,599 observation. For each observation comes the following variables:
Wine making is a complex process which involves way more than chemical compounds of the Wine itself, environmental variables linked to the wineyard such as year, temperature, locations but also linked to the winemakers such as the vinification, age of the wine and such are also components that could influence the wine.
last but not the least, the Quality is rated by 3 Wine experts. Given the wide range of taste, experience, and trends existing in the Wine Market. The rating could change over time.
Nevertheless, the experiment remains significant and will bring light on which of the measured variables influence the decision of the Wine Experts.
The data analysis will be ‘guided’ by the following question: Which chemical properties influence the quality of red wines?
To answer this question different prediction models will be explored to see which fit the best the data. As introduced in the dataset explanations we already know which model ‘fit’ the best the data, “Several data mining methods were applied to model these datasets (i.e. red and white wine dataset) under a regression approach. The support vector machine model achieved the best results.”
It is expected to have, at the end of the data analysis results that which are the variables that influence or not the quality of the wine.
The analysis will be performed only on the provided variables and, if anys, engineered variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The Fixed acidity measure follow a slightly risght skewed normal distribution, Median is at 7.9 g / dm^3 and mean value is at 8.32 g / dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volative acidity levels follows a slightly right skewed distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
This distribution has a lot of peaks and it is difficult to determine by seeing reading the vizualization what would be an average or median measure of Citric Acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Lot of bins are emtpy (this could be solved by adjusting the plot), I decided to let it because it is highly probable that the mean of measure was not precise enough to details the data to the precision of the histogram.
Neverthesless, the distribution looks slightly right skewed with a median ar 2.2 g / dm^3 and a mean at 2.539 g / dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides measures follows a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free sulfur dioxide follows a right skewed distribution, median value measured around 14 and mean around 16.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.00 22.00 38.00 46.43 62.00 289.00 2
Total sulfur dioxide follows a right skewed distribution with a median measured at 38 mg / dm^3 and a mean around 46 mg / dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density follows a normal distribution (centred below 1 g / cm^3, which is the density of water)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
the measure of pH follows a normal distribution, median / mean around 3.3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates follows a slightly right skewed distribution, median around 0.6g / dm^3, mean around 0.65g / dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Slightly right skewed distribution with a mean around 10.5 % Alcohol and a median around 10 %
The plot below shows the distribution of Wine ratings
Alcohol and density are highly corelated, which I guess is normal as Ethanol is lither than Water.
One of the strong relationship found is between fixed acidity, citric acid, density and pH. This relationship seems logic as it measures somehow very similar characteristics.
In order to guide and determines Significant variables, we plot it using the corrplot function:
The result shows an issue in the data with total sulfur dioxide.
##
## FALSE TRUE
## 1597 2
The Table of NAs shows only 2 missing values, as only 2 are missing out, an appropriate cleaning strategy could be to remove them or populate them with the mean of the sample.
The latter strategy will be used.
The correlation plot shows two small group of variables interacting: * Fixed acidity, citric acidity and PH : all the variables relates to the acidity of the wine. * Density and alcohol : Alcohol have a different density than water thus influencing on density measured.
##
## Call:
## lm(formula = quality ~ ., data = DT.redwine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.415e-12 -5.000e-16 1.090e-15 2.420e-15 5.226e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000e+00 1.184e-12 2.533e+12 <2e-16 ***
## fixed.acidity 5.008e-16 1.449e-15 3.460e-01 0.730
## volatile.acidity -8.778e-15 6.946e-15 -1.264e+00 0.207
## citric.acid 1.105e-14 8.191e-15 1.349e+00 0.178
## residual.sugar 9.324e-16 8.390e-16 1.111e+00 0.267
## chlorides -1.032e-14 2.341e-14 -4.410e-01 0.659
## free.sulfur.dioxide 4.665e-17 1.219e-16 3.830e-01 0.702
## total.sulfur.dioxide -1.712e-17 4.128e-17 -4.150e-01 0.678
## density -1.644e-12 1.210e-12 -1.359e+00 0.174
## pH -8.170e-15 1.072e-14 -7.620e-01 0.446
## sulphates 4.634e-15 6.504e-15 7.130e-01 0.476
## alcohol 2.360e-16 1.519e-15 1.550e-01 0.877
## quality_F4 1.000e+00 1.248e-14 8.016e+13 <2e-16 ***
## quality_F5 2.000e+00 1.167e-14 1.714e+14 <2e-16 ***
## quality_F6 3.000e+00 1.175e-14 2.554e+14 <2e-16 ***
## quality_F7 4.000e+00 1.211e-14 3.303e+14 <2e-16 ***
## quality_F8 5.000e+00 1.466e-14 3.410e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.592e-14 on 1580 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.045e+28 on 16 and 1580 DF, p-value: < 2.2e-16
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
A linear Model has been fitted to the data, the following list are the results (3 stars *** means high statistical relevance).
Among these, the three most insteresting variables too investigates further are Volatile Acidity Sulphates and Alcohol
The study shows a high significance between the ratings and and measured levels of Volatile Acidity, Chlorides, total sulfur dioxide, sulphates and Alcohol.
The two following plots are highlighting the significance of the measures on Quality ratings.
High significance variables:
Medium significance:
Non-significant variable:
This plot is very interesting as it show clearly a mark for Volatile Acidity measured. Around 0.4 g / dm^3 most wines are rated between 7 and 8. It would be interesting to see why this low acidity in wine bring better rating, one hypothesis could be that the wine age well and decrease it acidity over time.
it is important to plot the data to see the relationship between variables. Here, the total sulfur dioxide and alcohol vizualization. it shows clearly two distinct cluster between the wine that are rated low and average (3 to 5) and the others which are rated 6 +.
For the final plot, I chose this violin plot because I think we can learn something very interesting from it just by looking at it.
The Vizualization shows that wines below 11.5 % of Alcohol are usually rated lower than the ones above. Next time you chose a bottle, don’t hesitate to have a quick check on the alcohol rating while many other factor influences the expert ratings, we can see that most of the bad rated wine (5 and below) are below the 11.5% treshold. Also, most of the Wines above 12% are rate 6 and above.
Plotting so much different data and value was a bit tedious defining an harmonized color, scales and all titles for the plot was a long work. But in the end, I submitted this project without working on the scale and after adjusting all of the scale, the plotting part made so much more sense.
The big success is that someone can really read the data through the graph with the scales and colors and identify, by intuition, which are the values that are contributing to the rating of the Wine. Note that is also something to take with caution as we don’t know when the data has been collected and the wine rated. Wine taste evolves with time and follow industry trends. For example if a rating was 8 for a certain wine in 2000, exactly the same wine with the same taste may have another rating in 2017.
The analysis also highlighted for me the importance of plotting the data. The difference between Multivariates/ Bivariate and univariate plots are self explaining and I’ll definetely following this route in my next analyses.
One part, that I think should be further developped in the machine learning section is the linear regression, it have a more mathematic approach and can really guides the analysis and highlight the levels of statistical relevance of each variables.
Sharing this analysis with wine expert or chemists would be of great help to further understand the relationship between the data. Ethanol / Alcohol is for example lighter than water and does impact density, now, to which extent impact the other variable to the density? Many some expert knowledge could be used to simplify the models or to put some theoritical values to compare with.
It would have been intersting to plot the relation between the sulphur dioxide levels against the age of the wine and the quality rating, unfortunately this value is missing.
According to ,https://www.wineselectors.com.au/selector-magazine/wine/wine-101/preserving-the-truth-on-sulphates-in-wine, the sulphur dioxide levels decrease with Age as it dissipates over time.